| |
Note: For the latest copy of this documentation, please see
the complete EasyPatterns reference
(online)
Notation: "..." (without the quotes) is any appropriate
text, keyword or expression.
The entire pattern is a string, so it is enclosed in double-quotes. With a
few exceptions, the quotation marks will not be shown in this section.
the simplest pattern is just literal text |
... |
e.g. "abc" |
EasyPattern keywords are enclosed in brackets |
[...] |
e.g. "[letter]" |
literal text & keywords can be combined |
... [...] |
e.g. "abc[digit]" matches "abc1", "abc2", etc. |
[...] ... |
e.g. "[digit]abc" matches "1abc", "2abc", etc. |
... [...] ... |
e.g. "[digit]abc[digit]" matches "1abc1", "1abc2",
"2abc1", etc. |
multiple keywords can appear together |
[...][...] |
e.g. [letter][digit] matches "a1", "b1", etc. |
[..., ...] |
e.g. [letter, digit] instead of [letter][digit] |
[... ...] |
e.g. [letter digit] instead of [letter][digit] |
-
Any "][" pair in a pattern (other than within a quoted string) can
be replaced with a comma (followed by an optional space). Any comma (and
optional space) in a pattern can be replaced by "][". The meaning is the same.
-
Any "][" pair or comma in a pattern (other than within a quoted
string) can be replaced with a space (actually, one or more spaces), any space
in a pattern can be replaced by a comma or "][". The meaning is the same.
literal text can be included inside a bracketed
expression using single quotes |
['literal'] |
e.g. "['abc']" instead of "abc" |
[... 'literal'] |
e.g. "[digit, 'abc']" instead of "[digit]abc" |
['literal' ...] |
e.g. "['abc', digit]" instead of "abc[digit]" |
[... 'literal' ...] |
e.g. "[digit, 'abc', digit]" instead of "[digit]abc[digit]" |
-
There is usually no difference in meaning between including
literal text within a bracketed expression (in single quotes) and leaving
literal text outside the brackets. The choice is a matter of individual
preference. One exception: when using [not], only the single-quoted literal will
work, e.g. [not '-'].
The most important keywords represent character sets, that
is, any of a set of related characters.
anything, letters, digits, etc. |
[character], [char], [chars], [characters] |
All 256 chars (every character including NULL) |
[letter], [letters] |
includes é and ü, common in certain
European languages. |
[digit], [digits] |
decimal digits 0-9 |
[hexdigit] |
hexadecimal digits 0-9, a-f and A-F |
[punctuation] |
printing characters, excluding letters and digits, includes
!?.,:; " ' ' / - () {} - |
[symbol], [symbols] |
~@#$%^&* |
-
EasyPattern's [character] or [char] will match any character including
return. If you want any character except a return (or formfeed), use [paragraphChar];
that is, any character that could appear in a paragraph. Details below.
-
EasyPattern distinguishes punctuation from symbols; the sets do
not overlap. For broader combinations, see [printableChar] and [typewriterChar].
For narrower focus, see [sentencePunctuation], [anyQuote], [anyBracket] and [anyDash].
-
Note that « and » are considered punctuation.
special letters |
[upper], [uppercase], [uppercaseLetter] |
uppercase letters |
[lower], [lowercase], [lowercaseLetter] |
lowercase letters |
Reserved punctuation |
[leftBracket] |
[ |
[rightBracket] |
] |
[leftParen], [leftParenthesis] |
( |
[rightParen], [rightParenthesis] |
) |
[comma] |
, |
[singleQuote] |
' |
[doubleQuote], [quote] |
" |
(i.e. standard ASCII "straight" quotation mark) |
[backwardSingleQuote] |
` |
|
-
EasyPattern gives special meaning to certain punctuation marks;
these keywords can be used to represent the literal character.
-
Square brackets are "reserved" by EasyPattern. The other
punctuation marks listed here are literal when they appear outside of brackets.
-
A literal comma and left & right brackets & parentheses may appear
inside single quotes. The keywords are provided to make patterns easier to read.
There are many ways to create your own character sets to
match exactly the characters you require.
combining character sets with "or" |
[... or ...] |
e.g. [letter or digit], ['a' or 'b'] |
-
Most EasyPattern keywords refer to a set of characters from which one will
match. The first use of "or" is to make a larger set -- though again, only one
of the larger set will match. (Any quantity can be specified using repetition
keywords, but it is still applying a quantity to a single character not to
multiple characters)
-
When sets are combined with "or", parentheses are optional. (In
technical terms, the character set use of "or" has very high
precedence; see below)
"[letter, letter or digit, letter]" --
matches "aaa", "xyz", "h4q", "b7f" etc.
The commas are optional here too, the "or"
implicitly groups any character set keywords:
"[letter letter or digit letter]" -- same
as above
Of course, it doesn't hurt to add parentheses even
though they are not required.
"[letter (letter or digit) letter]" --
same as above
negation |
[not ...], [non ...], [anyExcept ...] |
e.g. [oneOrMore non letter] |
-
Instead of specifying all the characters that could occur in a
match, it is often convenient to specify characters that could not occur.
EasyPattern has keywords for [quotedString] and [HTMLTag], but if it didn't,
they would be easy to define:
['<', oneOrMore not '>', '>'] « same as [HTMLTag]
[quote, oneOrMore not quote, quote] « a simple definition for [quotedString]
-
Negation can only be applied to a single character, or a character
set from which one will match. For example:
[not letter] « fine
[not letter or digit] « fine: [letter or digit] is a set from which one
will match
[not word] « ERROR: "word" matches multiple characters
[not 'a'] « fine (a single character)
[not 'whatever'] « ERROR
[not lineChar or letter] « CAREFUL! [lineChar] is defined as [not linefeed OR verticalTab OR formfeed OR return].
You cannot combine negated and non-negated characters sets, so this pattern
is equivalent to [ (not lineChar) or letter ], instead of [ not (lineChar or
letter) ]
custom sets |
[<...>] |
e.g. [<aeiou>], [<135>], [<!@#$%^&*>] |
-
Keywords such as [letter] and [digit] are character sets defined
internally to EasyPattern; the angle bracket notation lets you define your own
characters sets. In both case, EasyPattern matches any single character in that
set.
-
For single characters, [or] and a set are interchangeable, e.g. [<aei>]
and ['a' or 'e' or 'i'] have the same meaning.
-
User defined sets, single character literals and EasyPattern
keywords can be combined with [or]:
[<aeiou> or <123> or '7' or symbol]
alternative patterns with "or" |
[... or ...] |
e.g. ['Player' or 'EasyPattern'] |
-
When "or" is used to specify alternatives as part of a larger
pattern, grouping parentheses are required, e.g.
"[space, 'Player' or 'EasyPattern', space]"
-- may not mean what you think!
"[space]Player[or]EasyPattern[space]" -- may not mean what you think!
"[(space, 'Player') or ('EasyPattern', space)]" -- that's what they mean
"[space ('Player' or 'EasyPattern') space]" -- this might be what you
wanted
Remember: as noted in the section on expressions,
commas are allowed between items to make patterns easier to read; they do not
affect what the pattern means.
-
If you leave out the parentheses, EasyPattern will treat everything to
the left of the "or" as one implicit group and everything to the right of the
"or" as a separate group. Note that visual grouping with brackets or commas is
not enough; you must use parentheses. For example, all of the following will be
interpreted as "[(digit, 'this') or ('that')]":
"[digit, 'this' or 'that']" « careful; the
commas may mislead
"[digit]['this' or 'that']" « careful; the brackets may mislead
"[digit 'this' or 'that']" « the grouping isn't clear;
parentheses would help
"[digit]this[or]that" « the grouping isn't clear; parentheses would help
As noted in the previous section, parentheses are not
required when "or" is used to combine character sets.
"or" as set vs. "or" as alternative
In many cases, you don't have to worry that there are two
different uses for "or"; both generally make sense in context. However, there
are 2 reasons for learning the differences:
-
or as set doesn't require parentheses; the grouping is implied
-
or as set can be part of a "not" expression since it still
represents one character
Notation: "..." is any appropriate keyword or expression, #
is a number (one or more digits; the maximum varies with context).
repetition |
examples |
will match... |
[optional ...], [zeroOrOne ...] |
[digit, optional letter], [digit, zeroOrOne letter] |
2, 2a |
[0+ ...], [zeroOrMore ...] |
[digit, zeroOrMore letters] |
2, 2a, 2aa, 2aaa, 2aaaa... |
[1+ ...], [oneOrMore ...] |
[digit, oneOrMore letters] |
2a, 2aa, 2aaa, 2aaaa... |
[2+ ...], [many ...], [twoOrMore ...] |
[digit, many letters], [digit, twoOrMore letters] |
2aa, 2aaa, 2aaaa... |
[#+ ...] |
[digit, 5+ letters] |
2aaaaa, 2aaaaaa... |
- A space is not allowed, e.g. "one or more" will not be recognized
- The words are all special cases, e.g. "threeOrMore" will not work
(use "3+")
specific quantity, quantity range (where # is a
number) |
will match... |
[# ...] |
[5 letters] |
aaaaa, bbbbb |
[# to # ...] |
[3 to 5 letters] |
aaa, aaaa, aaaaa |
shortest vs. longest match |
[shortest ... ...] |
match the lowest possible number of repetitions (default) |
[longest ... ...] |
match the highest possible number of repetitions |
-
EasyPattern defaults to the SHORTEST match so the "shortest" keyword
is optional.
-
When the repetition or count includes a range of values to match, EasyPattern has the choice of matching the "shortest" sequence of characters
that fits the pattern, or the "longest" that fits the pattern. For example:
[shortest zeroOrOne ...] |
0 or 1 |
will try to match zero occurrences |
[shortest zeroOrMore ...] |
0+ |
will try to match zero occurrences |
[shortest oneOrMore ...] |
1+ |
will try to match one occurrence |
[shortest twoOrMore ...] |
2+ |
will try to match two occurrences |
-
In these cases, EasyPattern will only match more than the minimum
if required to complete additional parts of the pattern, e.g. given "abc123" and
the pattern "[shortest oneOrMore letter, digit]", EasyPattern will match "abc1", i.e. all
3 letters. However, given the same string and the pattern "[shortest oneOrMore
letter]", EasyPattern will just match "a" the first letter. Given the same string and
"[longest oneOrMore letter]", EasyPattern will match "abc". Note that EasyPattern always
starts with the first character that matches that pattern, e.g. despite "c1"
being shorter than "abc1", EasyPattern matches the latter.
-
Shortest/longest can be confusing.
-
Shortest can be quite slow, use "not" if possible
literals, groups
All of the repetition & quantity keywords can be applied to
literals and groups as well as to individual keywords, e.g.
[oneOrMore 'ab'] « matches "ab", "abab", "ababab"
etc.
[oneOrMore letter or digit] « matches "aaa", "456", "a45bbb" etc.
[oneOrMore not letter or digit] « matches punctuation, symbols,
whitespace etc.
[oneOrMore ('alpha' or 'omega')] « matches "alphaalapha", "alphaomega"
etc.
[oneOrMore (letter, digit)] « matches "r2", "r2d2", "r2d2f7b2c4" etc.
[(...)] |
A non-capturing group |
[capture(...)] |
Matching text is captured into 'group#' in the pattern, and into $# in
the replacement. # can range from 1 to 26 |
[group#] |
# can range from 1 to 26 e.g. [(letter)1, group1] « matches "ee", "bb",
"cc" etc |
[mustBeginWith(...) ...], [mustNotBeginWith(...) ...] |
When a match is found, it must be/must not be preceded by what is in the
brackets. The bracket contents are NOT included in the actual match. The
bracket contents are limited to fixed length strings - so no '3+' etc are
allowed. This must be the first part of your pattern. [mustBeginWith(
'hello' or 'goodbye' ) 'fred'] |
[... mustEndWith(...)], [... mustNotEndWith(...)] |
When a match is found, it must be/must not be followed by what is in the
brackets. The bracket contents are NOT included in the actual match. The
bracket contents are limited to fixed length strings - so no '3+' etc are
allowed. This must be the last part of your pattern. ['fred' mustEndWith(
'erick' or 'dy' ) ] |
-
Parentheses without trailing digits form groups, e.g. to apply a
quantity or repetition.
-
Commas are optional, e.g. the following patterns are equivalent:
[( letter )1 ( digit )2]
[( letter )1, ( digit )2]
-
By adding a number immediately after ")", you are in effect assigning the
contents of the group to a variable; the variable can be referred to using
"group#", described below. Note: if you need more than 26 variables,
please send us an example to illustrate why!
-
Parentheses must match, i.e. ")" always ends the most recent "(",
independent of number.
EasyPattern allows comments to be included in multi-line patterns using the
character ';' or '#' to make the start of a comment, extending until the end of
the line e.g.
[ 3 space ;look for 3 spaces
'hello' #then the keyword we want
]
whitespace (including items covered above) |
[space], [spaces] |
ASCII 32 |
[nonbreakingSpace] |
ASCII 202 |
[whitespace] |
[space OR tab OR cr OR lf OR verticalTab OR nonbreakingSpace] |
[tab] |
ASCII 9, \t |
[return], [cr] |
ASCII 13, \r |
[linefeed], [lf] |
ASCII 10, \n |
[verticalTab] |
ASCII 11 |
[formfeed] |
ASCII 12, \f |
[null] |
ASCII 0 |
[CRLF] |
[return, linefeed] |
[newline] |
[(return, linefeed) or return or linefeed] |
[DOSNewline] |
[return, linefeed] |
[UNIXNewline] |
[linefeed] |
[MacNewline] |
[return] |
-
"not" cannot be applied to [CRLF], [newline] or [DOSNewline] since they
either are or may be a character sequence rather than just a single character.
- A space character can usually be typed directly into a pattern ([ ' ' ]) but
using the keyword may make the pattern easier to understand (and modify later)
whitespace combinations |
[horizontalWhitespace], [hSpace] |
[space or nonbreakingSpace or tab] |
[verticalWhitespace], [vSpace] |
[return or linefeed or formfeed or vertical tab] |
words, columns, lines & paragraphs |
[wordDelimiter] |
[space OR tab OR linefeed OR verticalTab OR formfeed OR
return] |
[wordChar] |
[not wordDelimiter] |
[word] |
[1+ wordChar] |
|
|
[columnDelimiter] |
[tab OR linefeed OR formfeed OR return] |
[columnChar] |
[not columnDelimiter] |
[column] |
[1+ columnChar] Note: Use [0+ columnChar] instead if
the column could be blank |
|
|
[lineDelimiter] |
[linefeed OR verticalTab OR formfeed OR return] |
[lineChar] |
[not lineDelimiter] |
[line] |
[1+ lineChar] Note: Use [0+ lineChar] instead if the
line could be blank |
|
|
[paragraphDelimiter] |
[formfeed OR return] |
[paragraphChar] |
[not paragraphDelimiter] |
[paragraph] |
[1+ paragraphChar] |
-
The above delimiters are characters not positions; they will
"consume" the character that they match. In contrast, [TextStart] and [TextEnd]
(below) are positions.
-
The above objects (word, column, line, paragraph) do not include
delimiters. So, to match multiple objects, you need to include the delimiters,
e.g.
[2+ word] -- won't match anything
[2+ (word, optional wordDelimiter)] -- correct
-
The definition for word is based strictly on whitespace so it will
include punctuation, matching text such as "$27.52" and "fancy+name". Although
in many cases it would be nice to exclude trailing punctuation, that pattern
would fail for text like "S.M.U.". When EP's definition of a word isn't
appropriate for your text, simply use the custom pattern that fits. For example,
[1+ wordChar, letter or digit or symbol]" would ensure that the last char is not
punctuation.
-
Word, column, line & paragraph require one or more character. If a
line might be empty, use: [0+ lineChar] instead of [line].
-
Because the definitions for word, column, line & paragraph look
for anything except the appropriate delimiter (rather than the leading
delimiter, a series of anything else, and the trailing delimiter), they can be
used to get the rest of a word, column, line & paragraph when the starting point
is already in the middle. See the example scripts for details.
-
These definitions allow control characters (except the specific whitespace used as delimiters) to appear in words, columns, lines & paragraphs.
-
A column may contain the verticalTab character. (It's used by
FileMaker to indicate line breaks within a field.)
-
Word, column, line & paragraph consist of multiple characters so
patterns like "[not word]" don't make sense.
positions |
[textStart] |
matches at start of entire text |
[textEnd] |
matches at end of the entire text or before newline at end |
[lineStart] |
matches
the start of a line |
[lineEnd] |
matches
the end of a line |
[wordBoundary] |
matches at a word boundary |
[notWordBoundary] |
matches when not at a word boundary |
combinations |
[controlChar] |
characters 0-31, 127
(careful: includes most whitespace) |
[gremlin] |
characters 0-31. The definition for [gremlin] is more cautious than in some products. |
[printableChar] |
[letter or digit or punctuation or symbol]
(anything that prints ink on paper) |
[typewriterChar] |
[printableChar or space or tab or return]
(excludes linefeed, vertical tab & formfeed) |
punctuation subsets (these items are included in
[punctuation]) |
[sentencePunctuation] |
.,;:!?¿¡ |
[anyBracket], [anyBrackets] |
left/right paren/bracket/brace
(i.e. "bracket" in the broad sense of the term) |
[anyQuote] |
[doubleQuote OR singleQuote OR backwardSingleQuote] |
[dash], [hyphen] |
-
used interchangeably. we have adopted the common notion that these terms
refer to the same character |
[period] |
. |
[caret] |
^ |
[pound], [hash] |
# |
[slash] |
/ |
[backslash] |
\ |
[colon] |
: |
[percent] |
% |
[star], [asterisk] |
* |
[ampersand] |
& |
real-world patterns |
[HTMLTag] |
<[1+ not '>']> |
[HTMLStartTag] |
<[not '/', 0+ not '>']> (i.e. any tag except an end
tag) |
[HTMLEndTag] |
</[1+ not '>']> |
[QuotedString] |
[quote, 1+ ((backslash, quote) or not quote), quote] |
[SocialSecurityNumber] |
[3 digits, dash, 2 digits, dash, 4 digits] |
[PhoneNumber] |
Matches a US-style (xxx) xxx-xxxx number with a variety of punctuation
marks. The matching text is captured into 3 successive $variables |
[EmailAddress] |
Matches email addresses. The name and domain parts are captured into
2 successive $variables |
[IPAddress] |
Matches numeric IP addresses. The matching text is captured into 4
successive $variables |
[CreditCard] |
Matches credit card numbers with a variety of punctuation marks. The
matching text is captured into 4 successive $variables |
[Hyperlink] |
Matches a ftp, http, https, telnet, gopher or nntp internet url. The
matching text is captured into 3 successive $variables |
[DuplicateWord] |
Matches a repeated word. The matching text is captured into 2 successive
$variables |
[PageNumber] |
Matches a page number of the following forms:
Page dd
Page No dd
Page No. dd
Page Num. dd
Pg Num dd
Page Number dd.The matching text is captured into 3 successive $variables
(Page, Number, #) |
data processing patterns (in TextPipe 6.8.2 and later) |
[CSVfield] |
A Comma-Separated-Value field. If fields are delimited by single or double
quotes, embedded newlines are allowed, as are doubled-up quotes. The quotes
are returned as part of the match. |
[TABfield] |
A Tab-delimited field. To process multiple tab fields e.g.
[ 3 or more ( TABfield tab) TABfield ] |
date and time patterns |
[Date] |
Matches a date format DD-MM-YY or DD-MMM-YY |
[AMPM] |
The AM/PM part of a time |
[Month] |
A MonthName or a MonthNumber |
[MonthNumber] |
1-12, with an optional leading zero |
[MonthName] |
January-December and Jan-Dec |
[Day] |
1-31 |
[DayOfYear] |
1..366 |
[Year] |
A
2 or 4 digit year (between 1800 and 2199) |
[Hour] |
A 12 or 24-hour hour, with optional leading zero |
[Minute] |
A 2 digit minute |
[Second] |
A 2 digit second |
Using the real world patterns above, you can easily construct the following
EasyPatterns:
HMS |
[ Hour <:.-> Minute <:.-> Second ] |
DMY |
[ Day <-/ > Month <-/ > Year ] |
MDY |
[ Month <-/ > Day <-/ > Year ] |
YMD |
[ Year <-/ > Month <-/ > Day ] |
Julian |
[ Year DayOfYear ] |
MY |
[ Month <-/ > Year ] |
MD |
[ Month <-/ > Day ] |
DM |
[ Day <-/ > Month ] |
HM |
[ Hour <:. > Minute ] |
A complete pattern may include many individual keywords and
many expressions. How do you know which keywords go together and where one
expression stops and another begins? If in doubt, just enclose every expression
in parentheses. But, EasyPattern has rules for combining keywords into expressions,
so parentheses aren't always required. The traditional way of expressing these rules
is to list the "precedence" of various operators or terms.:
-
(...), including numbered groups
-
[or] for characters sets and single-character literal
-
[not]
-
quantity specifiers
-
character set keywords (e.g. letter, digit) and single-character
literals
-
multi-character literals
-
[or] as alternative, for groups and multi-character literals
Items with high precedence don't need parentheses; they group together
automatically. For example, let's build a pattern step-by-step using the "high
precedence" operators:
[letter or digit] « "or" for characters
set keywords
[letter or digit or '.'] « and single-character literal
[letter or digit or '.' or <!?>] « and arbitrary set
[not letter or digit or '.' or <!?>] « reverse the meaning with not
[1+ not letter or digit or '.' or <!?>] « add a quantity specifier
[1+ (not (letter or digit or '.' or <!?>))] « if you like
parentheses, though
the meaning is the same
Adding lower precedence terms before, after or both doesn't change the
grouping, though the expression is long enough that you may find a pair of
commas, brackets, or parentheses helpful. As long as you understand how EasyPattern
is doing the grouping, it doesn't matter whether you choose commas, brackets or
parentheses. If the parentheses are added around something that is already a group, they
don't change the meaning.
[punctuation 1+ not letter or digit or '.' or
<!?> symbol]
[punctuation, 1+ not letter or digit or '.' or <!?>, symbol] « same
meaning but easier to read
[punctuation][1+ not letter or digit or '.' or <!?>][symbol] « same
meaning
[punctuation (1+ not letter or digit or '.' or <!?>) symbol] « same
meaning
Remember, commas and brackets don't change the meaning, only the look. If you
put them in the middle of high precedence terms, you might confuse yourself:
[punctuation 1+ not letter][or][digit or '.'
or <!?> symbol] « same meaning but HARDER to read
[punctuation 1+ not letter, or, digit or '.' or <!?> symbol] « same
meaning but HARDER to read
Only parentheses change the meaning:
[(punctuation 1+ not letter) or (digit or '.'
or <!?> symbol)] « different meaning
Note that "or" for character sets and "or" as alternative have opposite
precedence. See
Character Sets and
Alternatives (above) for details &
examples.
EasyPattern vs. perl regex or grep
At its core, EasyPattern uses "regular expression"
technology that is similar to the "regex" or "grep" tools that originated on
UNIX. EasyPattern's primary benefit is that the patterns are much easier to read
and write.
For those who have some experience with regex, here are a few specific
differences:
-
Quantity is specified as a prefix rather than a suffix. We believe
prefix notation is much more natural.
e.g. "[1+ digit]" rather than "[0-9]+"
-
Parentheses groups are not automatically numbered. Drawback (to some): you
have to include a number if you want to refer to that matched portion.
Benefits: the numbers don't change when you add other parentheses, you can
number only the groups that you want to use (the parentheses that are there
just for logical grouping don't get numbered).
-
No backslashes are required to "escape" special characters
(instead, EasyPattern provides keywords such as rightBracket). Benefit: Other pattern
languages already use backslash as an escape character so extra backslashes make
patterns even more difficult to read.
-
EasyPattern includes keywords for many character sets that require a custom
bracketed set in regex, e.g. punctuation, whitespace, paragraph, column, etc.
-
EasyPattern keywords generally include Macintosh-specific characters, e.g.
[letter] includes letters with umlauts and other diacritical marks
-
EasyPattern can combine character sets with "or" (as well as use "or" for
alternatives).
-
EP's [character] or [char] will match any character; the
"equivalent" in some products will match anything except carriage return. If you
want any character except a return (or formfeed), use [paragraphChar]; that is,
any character that could appear in a paragraph. Of course, [not return] works
too.
|